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Recent developments have shown that a lot can be gained for QCD simulations from GPU hard- 
ware. This can be exploited especially in the case of Ginsparg-Wilson fermions when the com- 
putational costs are particularly high. In this work, we use the Neuberger-Dirac operator as our 
realisation of Ginsparg-Wilson fermions, which greatly facilitate lattice investigations of decays 
like K — > %n. We report on the ongoing study of our GPU implementation of the Neuberger-Dirac 
operator including the exact treatment of the low lying eigenmodes of the Wilson-Dirac operator. 
Our benchmarks show that we achieve speed-up factors of around 23 and 16 in single and double 
precision, respectively. 
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1. Introduction 

In the past years the GPU, the core of hardware responsible to display images on computer 
screens, became more and more flexible and powerful. The GPU soon surpassed the CPU in terms 
of raw computational power, and various fields in science found a way to exploit this (see also the 
review in [p]]). For the lattice community, ref. ^ set out one ground-breaking paper in which it 
was demonstrated how to map the Wilson-Dirac operator on the GPU. With the introduction of 
NVIDIA CUDA, a flexible programming model was given and made it even easier to access the 
ever-increasing computational power. This was the basis for the reference paper [Bj. 

The GPU approach could be particularly important when it comes to simulations with Ginsparg- 
Wilson fermions [Q] which possess an order-of-magnitude higher computational costs than Wilson 
fermions. 

One particular solution to the Ginsparg-Wilson relation is the Neuberger-Dirac operator which 
we are studying in this paper. A lattice Dirac operator which satisfies chiral symmetry at nonzero 
lattice spacing greatly facilitates the investigation of certain processes like non-leptonic kaon de- 
cays. These decays raise the long-standing problem for QCD phenomenology why the AI = 1/2 
amplitudes are so much larger than the AI = 3/2 amplitudes. The bulk of the enhancement must 
be due to strong interactions ^ at low energies and therefore, a reliable explanation must be based 
on systematic non-perturbative methods, in particular on lattice QCD [g]. 

So far, a lot of effort has been invested to understand and quantify the effects responsible 
for the enhancement of the AI = 1/2 amplitudes. A series of simulations were performed using 
quenched calculations in the SU(4)-symmetric case in order to understand the effect of the charm 
quark [01 . One particular approach is to do simulations outside the GIM-limit with an active charm 
quark and reasonable large volumes to study finite-volume effect. 

In this paper we report on the ongoing development of our tools for numerical simulations in 
the £-regime utilizing GPU hardware for acceleration. We show details of the theoretical back- 
ground and implementation for our Wilson-Dirac kernel and for the Neuberger-Dirac operator. We 
give performance results for both operators and the calculation of low lying eigenmodes of the 
Wilson-Dirac operator. At the end we give an outlook about planned work. 

2. Implementation details 

2.1 The Wilson-Dirac operator 

Following Wilsons formulation for lattice QCD, the y 5 -Hermitian, massive Wilson-Dirac op- 
erator Q used in our implementation is defined by 

1 ±3 

D w 0(*) = (4 + m)<Kx)-- £ £/ M (x)(l-y M )0(*+a£), Q = y 5 D w , (2.1) 

1 /i=±0 

where m is the bare mass of the fermion. Gauge links (x) and the Dirac matrices y^ follow the 
definition 

U^fx (x) =U~ l (x-ap.) and y_ fl = -y fl . (2.2) 
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The most important part in the implementation of the operator is the data layout. Data read and 
written to global device memory on the GPU has to fulfill several constraints, known as coalescing 
rules []|], to ensure maximum performance. One of them requires that consecutive threads read 
data words in sequential order. A data word can have 32-, 64- or 128-Bytes. In [§] a data layout 
for the fermions fields and the gauge links was introduced which is nowadays considered as the 
default and our implementation follows that, too. 

Since the Wilson-Dirac kernel is memory-bound, we intend to minimize memory access. It is 
possible to save memory load instructions by realizing that one does not need the whole 18 entries 
of the gauge link matrices as they fulfill the orthogonalization relation U J U = 1 of the SU(3) gauge 
group. In total, we need a minimum of 8 parameters to uniquely define an element of the gauge 
links. Each element of SU(3) can be expressed as a linear combination of Gell-Mann-matrices, and 
in principle one could choose the coefficients in this linear combination as parameters. However, 
although the reconstruction comes practically for free, we should not introduce any overhead in 
the reconstruction. A reconstruction based on Gell-Mann matrices is not beneficial in terms of 
numerical effort. A much better approach is presented in ^ ^] which also has the advantage, 
that the inverse operation, i.e. finding the parameters for a given SU(3)-matrix, is simple. Another 
simple reconstruction scheme is based on the property 



U= a 1 b 1 c 2 \ c = (axb)*, (2.3) 




that one column of the matrix can be reconstructed by the other two. 

The rest of the implementation is fairly straightforward since there is no communication nec- 
essary if we assign each lattice site of the resulting fermion field a single thread. We explicitly write 
down the multiplication in Dirac-space of the eight directions pL = ±0 . . . ± 3, unroll the matrix- 
vector multiplication in color space after reconstruction and accumulate the resulting fermion in 
local registers. After the multiplication with y 5 , we write the result to global device memory. 

2.2 The Neuberger-Dirac operator 



Following the conventions and notations from [10], the Neuberger-Dirac operator Dn [11] can 
be defined in terms of the Wilson-Dirac operator Dw 

DN= 1 + Bsign(0 G = B(aDw _ 1 _, )< (2 . 4) 
a 

In this definition, \s\ < 1 is a tunable parameter while the sign-function has to be defined by its 
series expansion. For numerical stability we choose Chebychev polynomials, so the sign-function 
of the operator Q is given by 

sign(Q)~XP n {X 2 ), X=JL and P n (x) = £ c k T k {x). (2.5) 

The definition of the Chebychev polynomials T k iy) can be found in [|l2|]. Because Chebychev poly- 
nomials fulfill a recursion relation they permit an evaluation via the Clenshaw recurrence formula. 
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In order to find the coefficients q of the expansion, we aim to minimize the error 

8 = max |%)|, hiy) = 1 - y/yP(y) (2.6) 

e<y<l 

of the polynomial expansion for specified e. In the range \fe < \x\ < 1 the function xP(x 2 ) then 
approximates sign(x) uniformly with a maximal deviation 8. Polynomials constructed in such a 
way are often referred to as minmax polynomials. If e is chosen such that Q 2 > £||<2|| 2 , the error in 
( |2"3| ) is an operator with norm less than or equal to 8. The approximation error is always bounded 
by 5||t] ||, uniformly in the field v\ to which the operator is applied. 

This method is straightforward, but in the case of the £ -regime actually not recommended. In 
this case, the operator Q 2 may have some exceptionally low-lying eigenvalues, and it is far more 
efficient first to separate the few lowest modes and to treat them exactly. 

2.3 Low-mode projection 

The spectrum of Q in the vicinity of the origin can be reliably determined by minimizing the 



Ritz functional of Q [13]. Thereby, we also find an approximation of the associated eigenvectors. 



In order to control the error of the total approximation, we need to examine by how much these 



vectors deviate from the true eigenvector [10] 



Assuming that a specified number / of approximate eigenvectors has been computed, we de- 
note the linear space spanned by these vectors by V and by P the corresponding orthonormal pro- 
jector. We shall take it for granted that the eigenvalues v k of Q are separated from zero and from the 
rest of the spectrum by a distance greater than p . This number measures the deviation of V from 
being an exact eigenspace of Q. The presence of a spectral gap around zero implies that the subset 
of positive and negative eigenvalues can be identified without any numerical ambiguity. With the 
introduction of the eigenvectors u k £ V, FQuu = v k u k , the associated orthonormal projectors are 
given by 

P+ = £ u k ®(u k ) f and P_=^«^(^) t (2.7) 
v k >0 v t <0 

respectively. In the same way we may define the projectors (P±) eX act to the subspaces spanned by 
the corresponding exact eigenvectors of Q with positive and negative eigenvalues X k . We can give 
a quantitative estimate on the deviation of the computed projectors P± from the exact projectors 
(P±) exact- The total error depends on the size of the residues 

Pk = \\{Q-v k )u k \\ (2.8) 

and also on the distance between the eigenvalues, as a small value of p k does not exclude sizeable 
mixing of u k with several eigenvectors of Q if these have eigenvalues that are within a distance p k of 
v k . When estimating the deviation of the projectors rather than that of the individual eigenvectors, 
the interesting quantity is the distance d k of v k from the exact spectrum of Q in the subspace that is 
orthogonal to the range of (P+) eX act if > if (P-)exact if Vk < 0. The quality of the approximation 
is then controlled by the parameters 

4=£ Pk/4, *±>0 (2.9) 

±v k >0 
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GTX285 GTX480 C2050 



No. of Cores 


240 


480 


448 


Memory amount [MB] 


2048 


1536 


3072 


Shader Clock [MHz] 


1476 


1401 


1150 


Memory Clock [MHz] 


1242 


1848 


1500 


Memory Bandwidth [GB/s] 


159.0 


177.4 


144.0 



Table 1: Key specification of the three NVIDIA GPUs available for benchmarking. For the C2050, roughly 
10% of total memory amount has to be subtracted if the error correction ECC is enabled by the driver. 



and an upper bound for the deviation of the projectors can by given by 

r*-ff*wn< ,_tl(, 2 :l) - (210) 

In our implementation, the whole computation of the projectors is performed on the GPU. The 
number of low modes included in the projector will be determined dynamically in such a way that 
the spectral distance from the other modes is not accidentally small. The parameters K± can be 
estimated without difficulty and the Ritz functional is stopped when the desired level of precision 



is reached. Afterwards, we replace Equation 2.5 by 



sign(e) ~ P+ - P_ + (1 - P+ - P_)XP„(X 2 ), (2.1 1) 

and it can be shown that the approximation to the massive Neuberger-Dirac operator satisfies 

m-D^\\<- a {l+s- a -^)(K + + K_). (2.12) 

3. Results 



For performance results we have tested the GeForce GTX285 GPU, a current generation card, 
and the GeForce GTX480, as well as a Tesla C2050 GPU, the latter of which are both based on the 
recently released Fermi chipset. The code, however, was not optimized for the new architecture 
and we expect further enhancements in the future. An overview of the key features of each GPU is 
given in Table [I]. 

We have already stated that the performance of the Wilson-Dirac kernel is memory bound. In 
our optimization we aimed for maximizing the total memory bandwidth achieved by our imple- 
mentation. This also means that we should not expect strong scaling as the number of cores is 
increased. The major contribution to the performance of the calculation comes from the memory 
bandwidth. 

In Figure [T] the performance of the Wilson-Dirac kernel in single- and double precision is 
given as a function of the lattice volume. The double precision data curve on the C2050 is missing 
because we had only limited access to the GPU and it was not available anymore at the time of our 
testings. 

We can see a nearly constant behaviour for sufficient lattice volumes. On the GTX285 we 
observe a small peak at T = 64 which we believe is largely accidental. For double precision we see 
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Figure 1: Performance result for the application of the massive Wilson-Dirac operator Q. LEFT: Sin- 
gle precision performance. RIGHT: Double precision performance. The spatial extent L is fixed at 24 3 
throughout. 




Time extent T Lattice volume 



Figure 2: LEFT: Performance result of the application of the Neuberger-Dirac operator D as compared to 
the SSE2-optimized CPU version. The spatial extend L is fixed at 24 3 . RIGHT: Performance result for the 
calculation of low-lying eigenmodes of the Wilson-Dirac operator Q. The plots show the speed-up factor 
Tsse/tgpu m single and double precision, respectively. 



a decrease of the performance by a factor of around 8. On the GTX285, this comes from the fact 
that there is only 1 double precision unit while 8 single precision units are available. In principle, 
the architectural design of the GTX480 should give up to half of the single precision speed in 
double precision. NVIDIA, however, restricts this to a quarter of the single precision speed inside 
the driver. This drop in performance can be explained by the fact that we did not optimize for 
the new chipset. The final result for the Wilson-Dirac kernel gives around 140GFlop/s on the 
GTX285, 150GFlop/s on the C2050 and 170GFlop/s on the GTX480 in single precision. For 
double precision we achieve 22GFlop/s and 25GFlop/s on the GTX285 and GTX480 respectively. 

In Figure || the performance of the Neuberger-Dirac operator in single and double precision is 
illustrated as a comparison to the SSE2-optimized CPU version. The execution time of the operator 
is normalized on the number of Clenshaw iterations and the lattice volume. The CPU version was 
run on an Intel E6400 Core2 with a single core clock rate of 2.13 GHz. The GPU version was run 
on the GTX285. We can observe a nearly constant behaviour in the execution time in dependence 
to the lattice volume. For single precision, the SSE2 version has an average of Tsse = 0.877 jj.s and 



6 



The Neuberger-Dirac operator on the GPU 



Bjoern Walk 



the GPU version of Tqpu = 0.038 /is, giving a speedup factor of around 23. For double precision, 
the numbers read Tsse = 1-459 /is and Tgpu = 0.090 /is, hence, a speedup factor of around 16. 

4. Conclusions and outlook 

In this proceedings article, we gave a first introduction of our study of a GPU-based simu- 
lation program for lattice QCD based on the Neuberger-Dirac operator. We introduced the basic 
theoretical background and gave first performance results of our implementation of the Wilson- 
Dirac operator and the Neuberger-Dirac operator. We have shown that the performance for the 
Wilson-Dirac operator is of the same order of magnitude compared to other implementations pre- 
viously published. For the Neuberger-Dirac operator, we have shown that we can reach a speedup 
factor to the Wilson-Dirac operator of around 23 on single precision and 16 on double precision. 
We are going to develop code for the calculation of the index of our Neuberger-Dirac operator via 
zero-mode counting [^]. In the long run, we aim to integrate our code into a larger set of simulation 
programs to calculate observables in the process K — » %% in order to increase the lattice volume on 
those calculations. 
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